DOCUMENT RESUME 



ED 441 404 



IR 057 675 



AVAILABLE FROM 



PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



AUTHOR 

TITLE 



PUB DATE 
NOTE 



IDENTIFIERS 



Aliprand, Joan M. 

Cataloguing in the Universal Character Set Environment: 
Looking at the Limits , 

1999-08-00 

9p.; In; IFLA Council and General Conference. Conference 
Programme and Proceedings (65th, Bangkok, Thailand, August 
20-28, 1999); see IR 057 674. 

For full text; 

http ; //www. if la . org/lV/if la65/papers/079-155e . htm. 

Reports - Descriptive (141) -- Speeches/Meeting Papers (150) 

MFOl/PCOl Plus Postage. 

♦Alphabets; *Bibliographic Records; *Cataloging; Second 
Languages ; *Standards 

Anglo American Cataloging Rules 2 Revised; MARC; 
♦Transcription; ♦Unicode 



ABSTRACT 



A new era for multilingual, multiscript computing is 
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carefully synchronized. With the addition of Ethiopic, Mongolian, and 
Sinhala, all of the world's major scripts are covered. Catalogers may expect 
that such an extensive character repertoire will meet all their needs for 
exact transcription of bibliographic data. This paper examines the topic of 
exact transcription and situations where it is not applied currently. The 
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Standard and ISO/IEC 10646 is explained, followed by a discussion of whether 
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Abstract 



• Points of view or opinions stated in this 
document do not necessarily represent 
official OERl position or policy. 



A new era for multilingual, multiscript computing is beginning, due to the development of the 
Unicode Standard and International Standard ISO/IEC 10646. The character content in these 
publications is kept carefully synchronized. A major milestone has now been reached. With the 
addition of Ethiopia, Mongolian and Sinhala, all of the world's major scripts are covered. 

Cataloguers may expect that such an extensive character repertoire will meet all their needs 
for exact transcription of bibliographic data. This paper examines the topic of exact 
transcription, and situations where it is not applied currently. The conceptual structure 
underpinning the character repertoire of the Unicode Standard and ISO/IEC 10646 is 
explained, followed by a discussion of whether the use of simple strings of characters can meet 
all needs for exact transcription. 



Paper 

My first job was as a cataloguer, and though I'm now a systems analyst, I've maintained an 
active interest in the field. When I was learning about cataloguing in library school, the first 
edition of the Anglo-American Cataloguing Rules, the first rules based on the International 
Cataloguing Principles, was about to be published. I though that this was the last word on 
cataloguing, and not much more could be said. How wrong I was! And little did I dream that I 
would be contributing to the ongoing dialogue. 
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The focus of my presentation is descriptive cataloguing; chiefly the items that used to be 
called the "body" of the entry. Although I am focussing on descriptive cataloguing, some of 
what I say may be applicable generally, i.e., to all parts of bibliographic records, and even to 
other types of library records. 

In my presentation. I'll refer to AACR2.1 Now, I know that AACR2 isn't used everywhere. 
However, because I come from an English-speaking environment, these are the rules I know 
about. In addition, AACR2 has had an unusually broad influence: both directly and indirectly. 
Its direct influence has been through translations into other languages to serve as the basis for 
other cataloguing rules. It has indirect influence whenever one of the very large number of 
records created in the English-speaking world is used for copy cataloguing. Even when 
English is not the language of cataloguing, the information transcribed from the source of 
information might be useful and save time. 

Rule 1 .OE of AACR2, Language and script of the description, states in part: 

In the following areas, give information transcribed from the item itself in the language 
and script (wherever practicable) in which it appears there: 

Title and statement of responsibility 
Edition 

Publication, distribution, etc. 

Series 

Replace symbols or other matter that cannot be reproduced by the typographical 
facilities available with a cataloguer's description in square brackets. Make an 
explanatory note if necessary. 

The main topic I want to examine is transcription in the new computing environment brought 
about by the Unicode Standard- and International Standard ISO/IEC 10646.^ These 
publications cover not just the writing systems for all the major languages of the world, but 
collections of symbols and other elements of text, e.g., mathematical operators, Braille, 
punctuation, "dingbats", etc. Great care is taken to keep their character repertoires 
synchronized. 

I also want to examine the issue of faithful transcription, what I call "exactitude" of 
cataloguing. Throughout, I will mention effects on retrieval, especially intersystem searching, 
which we must bear in mind as we make cataloguing decisions. 

Now it was possible to have automated support for multiple scripts before the Unicode 
Standard and ISO/IEC 10646 - RLIN's scripts started with CJK in 1983,^ and East Asian 
standards have always included several scripts - but with the availability of Unicode-based 
products, multiscript implementation is easier. 

The Unicode Standard and ISO/IEC 10646 provide a much larger repertoire of scripts and 
characters than are currently authorized for any library application, including USMARC^ and 
UNIMARC.^ The expansion of the script repertoire means not only access to scripts that you 
never had before, but more characters for existing scripts. Here is a comparison for characters 
in several scripts. 
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Script 


Character 

Category 


USMARC/ 

UNIMARC 


JISX ! 
0208 2 1 


Unicode Standard 1 
Version 3.0 


j Cyrillic | 


Letters 


i02 


66 1 


237 


j Latin ! 


Additional | 

unaccented letters ■ 


21 1 


0 J 


163 


; Arabic | 


Letters , 


124 : 


none * 


141 


i East Asian i 

1 ideographs 


Ideographs ! 

.J 


13,469 (86% of 
EACC 5) 


6,353 j 

} 


27,484 1 



But please don't assume that the Unicode Standard and ISO/IEC 10646 will do everything for 
transcription: 

1. Not everything that you may see on a source of information is in their repertoires. 

2. Not eveiydhing you think you need for transcription can be in their repertoires. 

3. Certain scripts require additional implementation support and extended fonts for correct 
presentation. 

Which is not to say that you should reject these standards - 1 just want you to understand 
reality. 

What’s not there 

The good news is that, with the addition of Sinhala, Ethiopic and Mongolian, all the major 
scripts of the world are now encoded. Version 3.0 of the Unicode Standard is to be published 
later this year, and the second edition of ISO/IEC 10646 is scheduled for next year. 

Growth of the repertoire has not ended: various scripts for minority languages are still 
outstanding, more symbols could be added, and significant extinct scripts such as 
hieroglyphics and cuneiform are pending. (There may not be many libraries which collect and 
catalogue papyri and clay tablets, but the extinct scripts are significant for scholarship in 
general and certain museums in particular. 

A single font for even the current Unicode character repertoire would be very large, and it's 
more practical to have fonts only for the scripts your library has in its collections. \^at is 
more likely to occur as you catalogue is not lack of a script, but lack of a particular character, 
e.g., if the title of a work on mathematics includes a symbol that isn't in the Mathematical 
Operators block. So occasionally you can't transcribe 100% of what is on the source of 
information. 

But, you protest, I thought the Universal Character Set would have everything that I could 
possibly need! The response is no, for various reasons. 

• The thing that you see on the source of information is an extremely rare character, so 
was simply missed; 

• The thing that you see is known, and is being studied for possible addition; 

• The thing that you see is known, but is not regarded as a character according to the 
Unicode design principles. 

Two Unicode design principles are particularly significant in determining what should be 
encoded as a character: Characters, not glyphs and Unification across languages. In addition, 
the Unified Repertoire and Ordering of Han ideographs ("Unified Han"), developed by the 
Ideographic Rapporteur Group, has rules which determine uniqueness for ideographs. 

Characters, not glyphs means that some high-level typographical aspects are not significant 
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when it comes to determining the character repertoire. Examples of typographical aspects are: 

• The nashki style of Arabic writing versus the nastaliq style; 

• Different ways of writing an East Asian ideograph; 

• Different ways of writing a Cyrillic letter in particular languages; 

• Contractions, typographical digraphs, etc. 

Unification across languages means that: 

• The graphemes used to write a particular language (e.g., an alphabet) are not separately 
encoded; 

• Different language-based ways of writing a letter or ideograph are not encoded as 
separate characters. 

These design principles and rules determine what is to be uniquely encoded. And as a result, 
not everything that appears on a source of information is eligible to be directly encoded as a 
defined character. This limitation on what can be encoded directly as defined characters is not 
a failure of the Unicode Standard. It comes about because of a different and more sophisticated 
vision of what should be encoded in a character set. 

The original approach to the representation of text in machine-readable form was to give a 
unique code to each discrete mark on paper, although there was unification for generally 
accepted cases (the lower case forms of Latin letters a and g, for example). Character sets for 
East Asian languages assigned individual codes to different ways of writing what is 
fundamentally the same ideograph. Library character sets generally exhibit this "encode what 
you see" approach too, except for the use of non-spacing marks to encode accented Latin 
letters, where a letter with a diacritical mark is encoded as two characters. (Critics would say 
the letter is "broken apart.") 

The Unicode Standard introduced a layered approach to the representation of text. "The design 
for a character set encoding must provide precisely the set of code elements that allows 
programmers to design applications capable of implementing a variety of text processes in the 
desired languages. One result is that the characters in encoded text do not necessarily 
correspond 1 : 1 with the elements of that text in eye-readable form. 

The simplest type of text representation is plain text a pure sequence of character codes. 
Unicode data is plain text. But to render what is wanted exactly, it may be necessary to use 
higher level protocols, such as language identification or layout instructions, to produce fancy 
text or rich text . USMARC and UNIMARC also use only plain text, but their character sets 
may provide separate encodings for things that are unified in Unicode/ISO 10646. 

So we need to consider these issues: 

• How exact must we be in transcription? 

• If we have to be ultra-exact, how can we achieve this when we use Unicode/ 10646? 

Evaluation of exactitude of transcription 




So this brings us to consider the issue of exactitude of transcription. How exact does 
transcription have to be? Why? What exceptions do we make (perhaps without conscious 
decision-making)? What "work-arounds" do we use when we don't have the necessary 
typographical facilities? 



Ve need exactitude in transcription in order to represent the item being identified uniquely and 
;o make it accessible. Notice, however, that we don't always transcribe the information firom 
he item with 100% fidelity. 
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One reason for the lack of fidelity is that cataloguing rules or the interpretation of them by a 
cataloguing agency do not always require, and sometimes do not allow, specific data to be 
transcribed. Here's an example. The Hebrew language is normally written unvocalized, that is, 
without vowel points and other marks of pronunciation. But sometimes these pronunciation 
guides are printed on the source of information; for example, when the author or publisher 
wants a word to be pronounced in an uncommon way. The Library of Congress, in its 
guidelines for cataloguing Hebrew ,15 builds on Rule 1 .OG, Accents and other diacritical 
marks, and interprets it (incorrectly, in my opinion) as forbidding the transcription of 
vocalization marks that appear on the source of information. 

One exception to exactitude is necessitated by lack of typographical facilities, a problem 
recognized in Rule l.OE. The solution that this rule allows is the description of an unavailable 
textual element. This introduces an issue for intersystem searching - should the interpolation 
be ignored in searching, or treated as a "wild card" that matches anything, or. . .? The user 
cannot be expected to know the exact description written by the cataloguer. 

There are also unwritten rules for exceptions to exactitude. Except for antiquarian and other 
precious books, we routinely ignore font features, calligraphy, etc. when transcribing, without 
any attempt to note such features. This is based on practicality, since for most modem works, 
distinctions at a very detailed level aren't needed. 

When typographical facilities for a whole script are lacking, there are various options. When 
the language of cataloguing uses Latin script, the chosen solution is often romanization: 
transliteration or transcription into a Latin script form of the original text. Wellisch-L- reported 
in 1976 that the LC romanization tables (now ALA/LC) were most widely used, followed by 
those of ISO. When the language of cataloguing is Russian or another language written in 
Cyrillic script, cyrillicization is sometimes done. But not all languages use an alphabet or 
syllabary, and other solutions are to translate the information into the local language, or 
maintain card catalogues by script. 

Access is impeded by all these alternatives. Where a library uses romanization or 
cyrillicization, the searcher must know that fact, know which conversion scheme is used for a 
particular language, and be able to apply that scheme correctly to create a search argument. A 
searcher may not know about the library's practice and use a completely different scheme. For 
translations, the searcher's translation may not match that of the cataloger. Card catalogs, 
unless they have been published in book form, cannot be searched remotely. 

Lack of coded characters? 

These problems will be alleviated considerably through the introduction of Unicode/ISO 
10646 into USMARC and UNIMARC. But the use of a greatly expanded script repertoire does 
not mean that everything may be transcribed exactly. I now want to look at situations where 
even Unicode/ISO 10646 won't bring about 100% fidelity. 

Historically, a primary reason for exactitude in transcription was to provide a surrogate of the 
bibliographic entity with as much detail as possible. The detail was needed because we had no 
other way to present the item in a card or book catalogue. 

Problems of exact transcription are usually pointed out for ideographs, but this is not 
exclusively the case. If you're cataloguing a sound recording, what do you do about the name 
symbol used by "the artist formerly known as Prince"? 

One source of difficulty is mathematics, where 2-dimensional formulas must be forced into a 
1 -dimensional field. Sargent has described how to represent mathematical formulas using 
Unicode. 
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Problems with ideo^aphs arise because either the ideograph is not yet encoded, or when 
variant forms of an ideograph are represented by a single coded value (as noted by Zhang & 
Zhen).-!^ Unavailable ideographs include both truly unique ideographs (used for personal 
names) and those in common use in a particular environment but is not yet in Unified Han 
(e.g., some of the government-sanctioned ideographs used in Hong Kong, or ideographs 
occurring in geographic names). In this situation: 

• The geta symbol can be substituted for the unavailable ideograph. The geta comes from 
Japanese t^ography and is a placeholder for an ideograph to be supplied later. The 
technique is used in USMARC records. 

• Ideographic description characters are intended to help the user visualize the unavailable 
character. Version 3.0 of The Unicode Standard and the second edition of ISO/IEC 
10646 include these characters. 

When a particular typographic form has been unified with others, yet the cataloger wants to 
use only that particular form, these are possible solutions. 

• Use a higher level protocol, e.g., SGMLH mark-up, to insist that this character be 
presented in a particular writing style. (Since both USMARC and UNIMARC use plain 
text, this option is outside their current scope.) 

• Present the ideographic data in the record using a font determined by the language and 
country codes in the record. For example, if the language code was chi and the code for 
the country of publication was cc, the font would be a simplified Chinese style. If the 
language code was jpn, the font should be one with typical kanji. (This option will only 
work where the coded information is unequivocal, and when the ideographs appearing 
on the work are consistent with the preferred form for the language of the work and 
place of publication.) 

• The Unicode Technical Committee has been considering a proposal by which 
ideographic variants can be indicated in plain text. Perhaps this will provide a solution. 

Preferred regional or language forms are not exclusive to ideographs. When the Urdu language 
is written in Arabic script, it is conventionally printed in the nastaliq style. The Arabic 
language is usually printed in the nashki style. (Nashki is the style of the font used in RLIN's 
implementation of Arabic script.) Since all of the information on the work will normally be in 
the same typographic style, this can be highlighted through a note, when the typographic style 
on the item is not the same as that of the system. This is a situation similar to the black letter 
and Fraktur typographic styles of European printing. 

A general solution to the problem of inexact transcription in bibliographic records is to use 
hyperlinking. In a Web-based catalog, we can have a link to a picture (scanned image) of the 
actual source of information. The disadvantage of a scanned image is that it cannot be 
searched for a specific occurrence of a particular glyphic form, but this is an operation that is 
more likely to be applied to full text than to cataloging. 

Conclusion 

The editors of cataloguing rules should review the rules on transcription to determine whether 
changes are needed due to the new technical environment. The new technical environment 
includes not only use of Unicode/ISO 10646 but also the ability to search remote catalogues 
viaZ39.50. 

Those in charge of the various MARC formats have to work with cataloguers to determine 
whether it is necessary to re-evaluate the "plain text" of the current formats. It isn't just a case 
of declaring Unicode/ISO 10646 as an approved character set (as has been done for 
UNIMARCH) or specifying the necessary changes in detail (as is underway for both 
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USMARC-i^ and UNIMARC). That is the first and essential step, but cataloguing requirements 
may call for something beyond the "plain text" of the Unicode Standard and ISO/IEC 10646. 
If this is a requirement, then the various MARC formats will need to specify a methodology to 
provide this. 

The question that has to be answered is: Is cataloguing data "plain text" or does it need to be a 
little fancier? 
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